Optimizing Matrix-matrix Multiplication for an Embedded Vliw Processor
نویسنده
چکیده
The optimization of matrix-matrix multiplication (MMM) performance has been well studied on conventional general-purpose processors like the Intel Pentium 4. Fast algorithms, such as those in the Goto and ATLAS BLAS libraries, exploit common microarchitectural features including superscalar execution and the cache and TLB hierarchy to achieve near-peak performance. However, the microarchitectures of embedded processors typically use explicitly parallel in-order execution and have configurable memory hierarchies. Thus, approaches that find good MMM code for processors like the Pentium may not be as effective for embedded processors. For this project, I investigated the methods needed to achieve high performance MMM on an embedded VLIW (very-long instruction word) processor, the Texas Instruments C6713 floating-point DSP. This processor has three distinguishing features that affect an MMM implementation: an 8-wide in-order pipeline, an L2 mapped RAM, i.e., software-controlled scratch pad, and a direct memory access (DMA) engine. I present MMM implementations obtained through search and a model-driven approach that leverage the DSP microarchitecture. By using the scratch pad and DMA, I observed a 51% performance increase over a blocked MMM implementation.
منابع مشابه
A New Parallel Matrix Multiplication Method Adapted on Fibonacci Hypercube Structure
The objective of this study was to develop a new optimal parallel algorithm for matrix multiplication which could run on a Fibonacci Hypercube structure. Most of the popular algorithms for parallel matrix multiplication can not run on Fibonacci Hypercube structure, therefore giving a method that can be run on all structures especially Fibonacci Hypercube structure is necessary for parallel matr...
متن کاملDesigning Hardware/Software Systems for Embedded High-Performance Computing
In this work, we propose an architecture and methodology to design hardware/software systems for high-performance embedded computing on FPGA. The hardware side is based on a many-core architecture whose design is generated automatically given a set of architectural parameters. Both the architecture and the methodology were evaluated running dense matrix multiplication and sparse matrixvector mu...
متن کاملCo-design of Compiler and Hardware Techniques to Reduce Program Code Size on a VLIW Processor
Code size is a primary concern in the embedded computing community. Minimizing physical memory requirements reduces total system cost and improves performance and power efficiency. VLIW processors rely on the compiler to statically encode the ILP in the program before its execution, and because of this, code size is larger relative to other processors. In this paper we describe the co-design of...
متن کاملMatrix-Matrix Multiplications and Fault Tolerance on Hypercube Multiprocessors
Several new algorithms for matrix-matrix multiplications on hypercube multiprocessors are presented and evaluated based on the number of multiplications, additions, and transfers. The matrices ~I be multiplied are uniformly distributed to all processors of a hypercube system. Each processor owns some submatrices which are derived by dividing the source matrices. Each submatrix multiplication ca...
متن کاملPerformance of an embedded optical vector matrix multiplication processor architecture
An embedded architecture of optical vector matrix multiplier (OVMM) is presented. The embedded architecture is aimed at optimising the data flow of vector matrix multiplier (VMM) to promote its performance. Data dependence is discussed when the OVMM is connected to a cluster system. A simulator is built to analyse the performance according to the architecture. According to the simulation, Amdah...
متن کامل